Scikit-Learn Pipelines

Reading time: ~45 minutes | Level: Intermediate-Advanced

The Data Leakage Bug

You trained a fraud detection model. Validation AUC was 0.97. In production, AUC dropped to 0.72. The model looked perfect in evaluation, then immediately degraded.

The culprit was a single line written during data preparation:

from sklearn.preprocessing import StandardScaler
import numpy as np

# WRONG -- this is data leakage
scaler = StandardScaler()
X_all_scaled = scaler.fit_transform(X)   # fitted on ALL data including the test set

X_train_scaled = X_all_scaled[train_idx]
X_test_scaled  = X_all_scaled[test_idx]

# The scaler has seen the test set mean and variance.
# The model's validation metric is now optimistic -- it's been trained
# on information from the future (test set statistics).

The fix is mechanical when you use Pipelines:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# The Pipeline fits the scaler ONLY on training data during cross-validation.
# The scaler is applied (not re-fitted) when transforming the test fold.
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("clf",    LogisticRegression()),
])
pipe.fit(X_train, y_train)    # scaler.fit_transform(X_train), clf.fit(X_train_scaled, y_train)
pipe.predict(X_test)          # scaler.transform(X_test),       clf.predict(X_test_scaled)

This lesson is about making Pipelines a natural first instinct -- not an afterthought.

Why This Matters

Pipelines are not a convenience feature. They are the mechanism by which preprocessing becomes part of the model rather than a separate, error-prone script. Without a Pipeline:

Every preprocessing step must be repeated manually at inference time (and often diverges from training)
Cross-validation leaks test-fold statistics into transformers fitted on the full training set
Hyperparameter search across preprocessing choices requires manual bookkeeping
Serialising the model for deployment requires serialising multiple separate objects

With a Pipeline, you get a single object that is correct-by-construction, safe to cross-validate, easy to deploy, and trivial to serialise.

1. Pipeline Internals: The fit/transform Protocol

Every step in a Pipeline except the last must implement fit, transform, and fit_transform. The last step must implement fit and predict (or predict_proba, score, etc.).

The key insight: fit_transform is only called during pipe.fit(). During pipe.predict() and pipe.transform(), only transform() is called on intermediate steps -- the fitted state (learned mean, variance, components, etc.) is frozen.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
import numpy as np

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("pca",    PCA(n_components=10, random_state=42)),
    ("clf",    RandomForestClassifier(n_estimators=100, random_state=42)),
])

# Access intermediate steps by name
print(pipe.named_steps["scaler"])   # StandardScaler()
print(pipe["scaler"])               # same, using dict-like access (sklearn >= 0.23)

# After fitting, inspect the fitted state of any step
X_dummy = np.random.randn(100, 20)
y_dummy = np.random.randint(0, 2, 100)
pipe.fit(X_dummy, y_dummy)

print(pipe["scaler"].mean_[:5])     # per-feature means, fitted on X_dummy
print(pipe["pca"].explained_variance_ratio_[:5])

# Access transformed output at any intermediate stage
X_after_scaler = pipe["scaler"].transform(X_dummy)
X_after_pca    = pipe[:-1].transform(X_dummy)   # all steps except the last

2. ColumnTransformer: Heterogeneous Data

Real ML datasets rarely have uniform feature types. You have numerical features that need scaling, categorical features that need encoding, and text features that need vectorisation. ColumnTransformer applies different transformers to different column subsets in parallel, then horizontally concatenates the results.

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

# Realistic tabular dataset
df = pd.DataFrame({
    "age":        [25, 32, np.nan, 41, 28],
    "income":     [40000, 75000, 60000, np.nan, 55000],
    "gender":     ["M", "F", "F", "M", "M"],
    "education":  ["Bachelor", "Master", "PhD", "Bachelor", "Master"],
    "city":       ["NYC", "LA", "NYC", "Chicago", "LA"],
    "default":    [0, 0, 1, 0, 1],
})

X = df.drop(columns="default")
y = df["default"].values

# Define column groups by type
numerical_cols   = ["age", "income"]
ordinal_cols     = ["education"]
nominal_cols     = ["gender", "city"]

# Sub-pipelines for each group
numerical_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),    # robust to outliers
    ("scaler",  StandardScaler()),
])

ordinal_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OrdinalEncoder(
        categories=[["Bachelor", "Master", "PhD"]],  # explicit order matters
        handle_unknown="use_encoded_value",
        unknown_value=-1,
    )),
])

nominal_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(
        handle_unknown="ignore",   # silently ignore unseen categories at inference
        sparse_output=False,       # return dense array (sklearn >= 1.2)
    )),
])

# Combine into a ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numerical_pipe, numerical_cols),
        ("ord", ordinal_pipe,   ordinal_cols),
        ("nom", nominal_pipe,   nominal_cols),
    ],
    remainder="drop",      # drop any columns not listed above
    verbose_feature_names_out=False,   # cleaner feature names
)

# The full Pipeline
from sklearn.linear_model import LogisticRegression

full_pipe = Pipeline([
    ("preprocessor", preprocessor),
    ("clf",          LogisticRegression(max_iter=1000, random_state=42)),
])

full_pipe.fit(X, y)

# Inspect feature names out of the preprocessor
feature_names = full_pipe["preprocessor"].get_feature_names_out()
print(feature_names)
# ['age', 'income', 'education', 'gender_F', 'gender_M', 'city_Chicago', 'city_LA', 'city_NYC']

Why remainder="drop"? Explicitly listing all columns forces you to think about every feature. The alternative remainder="passthrough" silently passes unlisted columns through unchanged -- dangerous if a column leaks target information.

3. Custom Transformers

The real power of the Pipeline abstraction is that any class implementing fit, transform, and fit_transform can be a step. Sklearn provides BaseEstimator and TransformerMixin as mix-ins that give you get_params, set_params, and a default fit_transform = fit + transform.

import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin

class LogTransformer(BaseEstimator, TransformerMixin):
    """
    Applies log1p to right-skewed numerical features.

    Why not just use FunctionTransformer?
    Because this class supports feature-name-aware output via get_feature_names_out,
    stores the features it was fitted on (for validation at transform time),
    and can be serialised and inspected like any sklearn estimator.
    """

    def __init__(self, add_original: bool = False) -> None:
        # All hyperparameters must be stored as attributes with the same name.
        # BaseEstimator.get_params() uses inspect to discover them from __init__.
        self.add_original = add_original

    def fit(self, X: np.ndarray, y=None) -> "LogTransformer":
        # Store the number of features seen during fit for validation
        X = self._validate_data(X)   # sklearn helper: validates array shape and type
        self.n_features_in_ = X.shape[1]
        return self

    def transform(self, X: np.ndarray, y=None) -> np.ndarray:
        # Check that X has the same number of features as during fit
        X = self._validate_data(X, reset=False)

        log_X = np.log1p(np.abs(X)) * np.sign(X)   # signed log for negative values

        if self.add_original:
            # Append original features as additional columns
            return np.hstack([log_X, X])

        return log_X

    def get_feature_names_out(self, input_features=None) -> np.ndarray:
        names = [f"log_{f}" for f in self._get_feature_names(input_features)]
        if self.add_original:
            orig = list(self._get_feature_names(input_features))
            names += orig
        return np.array(names)

    def _get_feature_names(self, input_features) -> list[str]:
        if input_features is not None:
            return list(input_features)
        return [f"x{i}" for i in range(self.n_features_in_)]


class WinsorisationTransformer(BaseEstimator, TransformerMixin):
    """
    Clips features to [lower_quantile, upper_quantile] of the TRAINING distribution.

    This is the correct way to handle outliers: the clip bounds are learnt from
    training data only, then applied to test data without re-fitting.
    """

    def __init__(self, lower: float = 0.01, upper: float = 0.99) -> None:
        self.lower = lower
        self.upper = upper

    def fit(self, X: np.ndarray, y=None) -> "WinsorisationTransformer":
        X = self._validate_data(X)
        # Compute bounds per feature from training data
        self.lower_bounds_ = np.quantile(X, self.lower, axis=0)
        self.upper_bounds_ = np.quantile(X, self.upper, axis=0)
        return self

    def transform(self, X: np.ndarray, y=None) -> np.ndarray:
        X = self._validate_data(X, reset=False)
        # Clip using bounds learnt at fit time -- not re-computed from test data
        return np.clip(X, self.lower_bounds_, self.upper_bounds_)


# Compose custom transformers in a Pipeline
from sklearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingClassifier

rng = np.random.default_rng(0)
X = rng.lognormal(0, 1, size=(500, 6))   # right-skewed features
y = rng.integers(0, 2, size=500)

pipe = Pipeline([
    ("winsorise",  WinsorisationTransformer(lower=0.02, upper=0.98)),
    ("log",        LogTransformer(add_original=False)),
    ("clf",        GradientBoostingClassifier(random_state=42)),
])

from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipe, X, y, cv=5, scoring="roc_auc")
print(f"CV AUC: {scores.mean():.3f} +/- {scores.std():.3f}")

4. FeatureUnion: Parallel Feature Extraction

FeatureUnion applies multiple transformers in parallel and concatenates their outputs. The classic use case is combining TF-IDF features from text with handcrafted numeric features.

import numpy as np
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.base import BaseEstimator, TransformerMixin

class TextSelector(BaseEstimator, TransformerMixin):
    """Selects a single text column from a DataFrame."""
    def __init__(self, key: str) -> None:
        self.key = key

    def fit(self, X, y=None): return self
    def transform(self, X): return X[self.key].fillna("")

class NumericSelector(BaseEstimator, TransformerMixin):
    """Selects numeric columns from a DataFrame."""
    def __init__(self, keys: list[str]) -> None:
        self.keys = keys

    def fit(self, X, y=None): return self
    def transform(self, X): return X[self.keys].fillna(0).values

# Text branch: extract TF-IDF features, then reduce with SVD (LSA)
text_branch = Pipeline([
    ("selector",  TextSelector(key="review_text")),
    ("tfidf",     TfidfVectorizer(max_features=5000, ngram_range=(1, 2))),
    ("svd",       TruncatedSVD(n_components=50, random_state=42)),  # dense 50-d
    ("scaler",    StandardScaler()),
])

# Numeric branch: scale structured features
numeric_branch = Pipeline([
    ("selector", NumericSelector(keys=["word_count", "rating", "verified_purchase"])),
    ("scaler",   StandardScaler()),
])

# Combine both branches in parallel
combined_features = FeatureUnion([
    ("text",    text_branch),
    ("numeric", numeric_branch),
])

# Full Pipeline
from sklearn.linear_model import LogisticRegression

full_pipe = Pipeline([
    ("features", combined_features),
    ("clf",      LogisticRegression(C=1.0, max_iter=1000)),
])

# The Pipeline handles the DataFrame -> features -> predictions chain correctly,
# with no manual array management.

FeatureUnion vs ColumnTransformer: prefer ColumnTransformer for tabular data (it handles DataFrames natively and is more explicit about column selection). Use FeatureUnion when you need truly parallel pipelines with heterogeneous input types (e.g., combining image and text features).

5. Pipeline + GridSearchCV: Correct Hyperparameter Search

The Pipeline's most important property: it can be passed directly to GridSearchCV or RandomizedSearchCV, and sklearn ensures the transformer is fitted only on the training fold, not the validation fold.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, StratifiedKFold
import numpy as np

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("pca",    PCA(random_state=42)),
    ("svm",    SVC(probability=True)),
])

# Hyperparameter grid: use double underscore to target steps by name
param_grid = {
    "pca__n_components": [5, 10, 20],       # PCA hyperparameter
    "svm__C":            [0.1, 1.0, 10.0],  # SVM regularisation
    "svm__kernel":       ["rbf", "linear"],  # SVM kernel
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

search = GridSearchCV(
    pipe,
    param_grid,
    cv=cv,
    scoring="roc_auc",
    n_jobs=-1,
    verbose=1,
    refit=True,   # after search, refit best config on entire training set
)

# Generate toy data
rng = np.random.default_rng(42)
X = rng.normal(size=(300, 25))
y = (X[:, 0] + X[:, 1] > 0).astype(int)

search.fit(X, y)
print(f"Best params:  {search.best_params_}")
print(f"Best CV AUC:  {search.best_score_:.4f}")

# search.best_estimator_ is a full Pipeline, ready for inference
best_pipe = search.best_estimator_
proba = best_pipe.predict_proba(X[:5])

What safe CV with Pipelines actually prevents: with a plain GridSearchCV over manually preprocessed data, StandardScaler.fit() is called on all training data including the validation fold at each CV split. The scaler learns the validation fold's mean and variance, leaking information. When the scaler is inside the Pipeline, sklearn calls scaler.fit_transform(X_train_fold) and scaler.transform(X_val_fold) -- the validation fold never influences the scaler.

6. The set_output API (sklearn >= 1.2)

Before sklearn 1.2, Pipeline transformers always returned numpy arrays, making it impossible to track feature names through the chain. The set_output API lets transformers return DataFrames, preserving column names end-to-end.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
import pandas as pd
import numpy as np

df = pd.DataFrame({
    "age":    [25, 32, 41, 28, 35],
    "income": [40000.0, 75000.0, 60000.0, 55000.0, 68000.0],
    "gender": ["M", "F", "F", "M", "F"],
})

preprocessor = ColumnTransformer([
    ("num", StandardScaler(), ["age", "income"]),
    ("cat", OneHotEncoder(sparse_output=False), ["gender"]),
])

# Enable DataFrame output globally for this transformer chain
preprocessor.set_output(transform="pandas")

X_transformed = preprocessor.fit_transform(df)
print(type(X_transformed))   # pandas.core.frame.DataFrame
print(X_transformed.columns.tolist())
# ['age', 'income', 'gender_F', 'gender_M']

This makes debugging transformations far easier: pipe[:-1].transform(X) returns a labelled DataFrame instead of an anonymous array.

7. Pipeline Persistence with joblib

A Pipeline is a single serialisable object. Save it once; load it anywhere.

import joblib
from pathlib import Path
from sklearn.pipeline import Pipeline

def save_pipeline(pipe: Pipeline, path: str | Path) -> None:
    """
    Persists a fitted Pipeline using joblib.
    joblib is preferred over pickle for sklearn objects because it handles
    numpy arrays more efficiently via memory-mapped files.
    """
    path = Path(path)
    path.parent.mkdir(parents=True, exist_ok=True)
    joblib.dump(pipe, path, compress=3)   # compress=3 is a good size/speed tradeoff
    print(f"Pipeline saved to {path}  ({path.stat().st_size / 1024:.1f} KB)")

def load_pipeline(path: str | Path) -> Pipeline:
    """Loads a fitted Pipeline from disk."""
    pipe = joblib.load(path)
    print(f"Loaded pipeline: {pipe.steps}")
    return pipe


# --- Training side ---
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import numpy as np

rng = np.random.default_rng(0)
X_train = rng.normal(size=(200, 10))
y_train = rng.integers(0, 2, size=200)

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("clf",    LogisticRegression()),
])
pipe.fit(X_train, y_train)
save_pipeline(pipe, "models/baseline_v1.joblib")

# --- Inference side (different process / server) ---
loaded_pipe = load_pipeline("models/baseline_v1.joblib")
X_new = rng.normal(size=(5, 10))
predictions = loaded_pipe.predict(X_new)
print(predictions)

Version pinning: the loaded Pipeline uses the sklearn version it was serialised with. Always record the sklearn version alongside the saved model file. A version mismatch can silently produce wrong predictions if an API changed between versions.

8. Custom Estimator for Business Logic

Sometimes the final step in a Pipeline is not a standard sklearn estimator. You need a class that wraps business logic: threshold adjustment, post-processing, or an external model.

import numpy as np
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.validation import check_is_fitted, check_X_y

class ThresholdClassifier(BaseEstimator, ClassifierMixin):
    """
    Wraps any probability-outputting classifier and applies a custom
    decision threshold instead of the default 0.5.

    Use case: in fraud detection you may want threshold=0.2 to maximise
    recall at the cost of precision -- this is a business decision, not
    a statistical one, and it belongs in the Pipeline.
    """

    def __init__(self, base_clf, threshold: float = 0.5) -> None:
        self.base_clf  = base_clf
        self.threshold = threshold

    def fit(self, X: np.ndarray, y: np.ndarray) -> "ThresholdClassifier":
        X, y = check_X_y(X, y)
        self.base_clf.fit(X, y)
        self.classes_ = self.base_clf.classes_
        return self

    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        check_is_fitted(self)
        return self.base_clf.predict_proba(X)

    def predict(self, X: np.ndarray) -> np.ndarray:
        check_is_fitted(self)
        proba = self.predict_proba(X)[:, 1]   # positive class probability
        return (proba >= self.threshold).astype(int)

    def score(self, X: np.ndarray, y: np.ndarray) -> float:
        """Default score is accuracy at the custom threshold."""
        return (self.predict(X) == y).mean()


from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("clf",    ThresholdClassifier(
                   base_clf=LogisticRegression(max_iter=1000),
                   threshold=0.25,   # capture more fraud at the cost of false positives
               )),
])

rng = np.random.default_rng(1)
X = rng.normal(size=(300, 10))
y = rng.binomial(1, 0.15, size=300)   # 15% positive (fraud-like imbalance)

pipe.fit(X, y)
preds = pipe.predict(X[:5])
print(preds)

# The threshold is a Pipeline hyperparameter -- searchable with GridSearchCV
from sklearn.model_selection import GridSearchCV
search = GridSearchCV(
    pipe,
    param_grid={"clf__threshold": [0.2, 0.25, 0.3, 0.4]},
    scoring="recall",   # optimise for catching fraud
    cv=5,
)
search.fit(X, y)
print(f"Best threshold: {search.best_params_['clf__threshold']}")

9. Production Patterns

Pattern 1: Train/Test Split OUTSIDE the Pipeline

from sklearn.model_selection import train_test_split

# Always split BEFORE building the pipeline.
# The pipeline itself handles fit/transform separation during CV.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

pipe.fit(X_train, y_train)
test_score = pipe.score(X_test, y_test)

Pattern 2: Feature Names Through the Full Pipeline

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# After fitting, retrieve feature names at the output of any step
preprocessor.fit(X_train)
output_names = preprocessor.get_feature_names_out()

# Map these back to tree model feature importances
importances = pipe["clf"].feature_importances_
importance_df = pd.DataFrame({
    "feature":    output_names,
    "importance": importances,
}).sort_values("importance", ascending=False)

Pattern 3: Pipeline Cloning for Safe Experiment Tracking

from sklearn.base import clone

base_pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("clf",    LogisticRegression()),
])

# clone() creates an unfitted copy with the same hyperparameters.
# Use this to run experiments from a shared baseline without mutation.
experiment_pipe = clone(base_pipe)
experiment_pipe.set_params(clf__C=10.0)
experiment_pipe.fit(X_train, y_train)

Pattern 4: Partial Refit at Inference Time

# If new categories appear in production, you may need to refit only the encoder.
# Pipelines support step replacement without rebuilding.

from sklearn.preprocessing import OrdinalEncoder

pipe.set_params(preprocessor__cat__encoder=OrdinalEncoder(
    handle_unknown="use_encoded_value",
    unknown_value=-1,
))
# Then refit only on the new data:
pipe.fit(X_new, y_new)

10. Common Mistakes

Mistake 1: Fitting transformers before the Pipeline

# BAD: scaler is fitted on ALL data before any train/test split
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_all)   # test data contaminated

X_train = X_scaled[:800]
X_test  = X_scaled[800:]

# GOOD: the scaler is inside the Pipeline, fitted only during .fit()
pipe = Pipeline([("scaler", StandardScaler()), ("clf", LogisticRegression())])
pipe.fit(X_train, y_train)
pipe.predict(X_test)

Mistake 2: Using fit_transform at inference time

# BAD: re-fits the scaler on inference data -- different mean/std than training
X_new_scaled = scaler.fit_transform(X_new)   # scaler state overwritten!

# GOOD: only transform at inference
X_new_scaled = scaler.transform(X_new)       # uses training-time mean/std

# With a Pipeline, this is impossible to get wrong -- pipe.predict() always
# calls transform(), never fit_transform(), on intermediate steps.

Mistake 3: Not using make_pipeline for quick prototypes

from sklearn.pipeline import make_pipeline

# make_pipeline infers step names automatically (lowercase class names).
# Use it for quick prototyping; use Pipeline([...]) when you need named access.
pipe = make_pipeline(StandardScaler(), PCA(10), LogisticRegression())
# Step names: 'standardscaler', 'pca', 'logisticregression'
print(pipe.named_steps.keys())

Mistake 4: Forgetting remainder in ColumnTransformer

# If a DataFrame column is NOT listed in any transformer, the default
# remainder="drop" silently drops it. This is often what you want -- but
# remainder="passthrough" silently includes it unchanged, which can leak
# target-correlated columns you forgot to remove.
# Always be explicit about which columns exist and what to do with them.

Key Takeaways

A Pipeline is the atomic unit of an ML model. Training code, preprocessing, and inference code must be the same object. Never separate them.
Data leakage in CV is the silent killer of optimistic metrics. Any transformer fitted before CV sees test-fold data. A Pipeline with GridSearchCV prevents this by design.
ColumnTransformer applies different preprocessing to different column types in parallel, then concatenates. Use it for every real-world tabular dataset.
Custom transformers inherit from BaseEstimator + TransformerMixin. Store all hyperparameters as __init__ parameters with matching attribute names; sklearn's grid search depends on this.
fit_transform is for training; transform is for inference. A Pipeline enforces this distinction automatically -- you cannot accidentally call fit_transform at inference time.
Serialise Pipelines with joblib.dump, not raw pickle. Record the sklearn version alongside every saved model.
The set_output(transform="pandas") API (sklearn >= 1.2) enables DataFrame output throughout the Pipeline, making debugging far easier.
Business logic (threshold adjustment, score calibration, post-processing) belongs in the Pipeline as a custom estimator step -- not in separate inference scripts.

Practice Problems

Problem 1 -- Leakage Audit Take an existing preprocessing script that manually applies fit_transform before train/test split. Refactor it into a complete Pipeline. Measure the CV AUC before and after refactoring. The before-refactoring AUC should be higher (optimistically biased) than the post-refactoring AUC. Document the difference.

Problem 2 -- Custom Imputer Write a GroupMedianImputer that imputes missing numerical values with the median of a groupby column (e.g. fill missing income with the median income for that person's city). The group medians must be computed only from training data. Integrate it into a ColumnTransformer Pipeline.

Problem 3 -- Text + Numeric Pipeline Build a Pipeline for a sentiment classification task. The input is a DataFrame with a text column and three numeric columns (word_count, avg_word_length, exclamation_count). Use a FeatureUnion to combine TF-IDF (text branch) with StandardScaler (numeric branch). Run a RandomizedSearchCV over TF-IDF max_features, SVD n_components, and LogisticRegression C. Report the best validation AUC and the best hyperparameter combination.

Problem 4 -- Pipeline Versioning Write a VersionedPipeline that wraps sklearn's Pipeline and adds: (a) a metadata dict storing sklearn version, python version, training date, and dataset hash; (b) a save method that writes the Pipeline and metadata to a directory; (c) a class method load that loads and validates that the sklearn version matches the current environment, raising a warning if it does not.

Problem 5 -- Calibrated Probability Pipeline Wrap a GradientBoostingClassifier in a CalibratedClassifierCV (from sklearn.calibration) inside a Pipeline, and verify that the output probabilities are well-calibrated using a reliability diagram (fraction of positives vs mean predicted probability, binned). Compare calibration before and after the calibration wrapper.

The Data Leakage Bug​

Why This Matters​

1. Pipeline Internals: The fit/transform Protocol​

2. ColumnTransformer: Heterogeneous Data​

3. Custom Transformers​

4. FeatureUnion: Parallel Feature Extraction​

5. Pipeline + GridSearchCV: Correct Hyperparameter Search​

6. The set_output API (sklearn >= 1.2)​

7. Pipeline Persistence with joblib​

8. Custom Estimator for Business Logic​

9. Production Patterns​

Pattern 1: Train/Test Split OUTSIDE the Pipeline​

Pattern 2: Feature Names Through the Full Pipeline​

Pattern 3: Pipeline Cloning for Safe Experiment Tracking​

Pattern 4: Partial Refit at Inference Time​

10. Common Mistakes​

Key Takeaways​

Practice Problems​